对话场景是语音处理技术最重要,最具挑战性的场景之一,因为对话中的人们以随意的方式相互反应。在对话中检测每个人的语音活动对于下游任务,例如自然语言处理,机器翻译等。人们指的是“何时说话”作为说话者诊断(SD)的检测技术。传统上,诊断错误率(DER)长期以来一直用作SD系统的标准评估度量。但是,der没有给简短的对话短语提供足够的重视,这在语义层面上很重要。此外,在语音社区中,仍然无法使用精心准确的手动测试数据集,适合评估对话性SD技术。在本文中,我们设计和描述了对话式短语扬声器诊断(CSSD)任务,该任务包括培训和测试数据集,评估指标和基线。在数据集方面,尽管先前开源的180小时对话魔术Data-RAMC数据集,但我们还准备了一个20小时的对话演讲测试数据集,并精心验证了CSSD任务的时间戳注释。在度量方面,我们设计了新的对话der(CDER)评估度量,该评估度量计算出语音级别的SD准确性。在基线方面,我们采用了一种常用的方法:变异贝叶斯HMM X-vector系统,作为CSSD任务的基线。我们的评估指标可在https://github.com/speechclub/cder_metric上公开获得。
translated by 谷歌翻译
由于训练和测试分布之间的不匹配,自动语音识别(ASR)的跨域性能可能会受到严重阻碍。由于目标域通常缺乏标记的数据,并且在声学和语言水平上存在域移位,因此对ASR进行无监督的域适应性(UDA)是一项挑战。先前的工作表明,通过利用未标记的数据的自我检查,自我监督的学习(SSL)或伪标记(PL)可以有效地进行UDA。但是,这些自我介绍也面临不匹配的域分布中的性能退化,而以前的工作未能解决。这项工作提出了一个系统的UDA框架,可以在预训练和微调范式中充分利用具有自学贴标签的未标记数据。一方面,我们应用持续的预训练和数据重播技术来减轻SSL预训练模型的域不匹配。另一方面,我们提出了一种基于PL技术的域自适应微调方法,并具有三种独特的修改:首先,我们设计了一种双分支PL方法,以降低对错误的伪标签的敏感性;其次,我们设计了一种不确定性感知的置信度过滤策略,以提高伪标签的正确性。第三,我们引入了两步PL方法,以结合目标域语言知识,从而产生更准确的目标域伪标记。各种跨域场景的实验结果表明,所提出的方法可以有效地提高跨域的性能,并显着超过以前的方法。
translated by 谷歌翻译
具有联合学习(FL)的自动语音识别(ASR)使得在不损害隐私的情况下利用来自多个客户的数据。基于FL的ASR质量可以通过识别性能,沟通和计算成本来衡量。当不同客户之间的数据不是独立且分布相同的(非IID)时,性能可能会大大降低。在这项工作中,我们使用个性化的FL解决了基于FL的ASR中的非IID问题,该问题为每个客户学习个性化模型。具体而言,我们提出了两种类型的ASR个性化FL方法。首先,我们将基于个性化的FL适应ASR,该层在本地保留一些层以学习个性化模型。其次,为了降低沟通和计算成本,我们提出了脱钩的联合学习(Decouplefl)。一方面,DeCoupleFL将计算负担移至服务器,从而减少了客户端的计算。另一方面,Decouplefl传达安全的高级功能而不是模型参数,从而在模型大时降低通信成本。实验表明,与FedAvg相比,两种提出的基于FL的ASR方法可以将WER降低2.3%-3.4%。其中,与FedAvg相比,Decouplefl仅具有11.4%的通信和75%的计算成本,这也明显少于基于个性化的FL。
translated by 谷歌翻译
自我监督的声学预培训已经在自动语音识别(ASR)任务上取得了惊人的结果。大多数成功的声学预训练方法使用对比学习来通过区分不同时间步长的表示来学习声学表示,忽略扬声器和环境鲁棒性。因此,在微调期间,预先训练的模型可以表现出域名数据的性能不佳。在这封信中,我们通过利用用于声学预训练的数据增强来设计一种新的一致性对比学习(CCL)方法。在原始音频上应用不同类型的增强,然后将增强的Audios馈入编码器。编码器不仅应将表示在一个音频中的表示相反,而且还可以最大限度地提高不同增强音频的表示的测量。通过这种方式,预先训练的模型可以学习与扬声器或环境的变化更加强大的文本相关的表示方法。实验表明,通过在WAV2VEC2.0上应用CCL方法,可以实现更好的结果都在域内数据和域外数据。特别是对于嘈杂的域名数据,可以获得超过15%的相对改进。
translated by 谷歌翻译
自我监督的预训练可以有效地改善低资源自动语音识别(ASR)的性能。但是,现有的自我监督的预训练是任务不合时宜的,即可以应用于各种下游任务。尽管它扩大了其应用的范围,但预训练模型的容量并未完全用于ASR任务,并且学习的表示形式可能对ASR不最佳。在这项工作中,为了为低资源ASR构建更好的预训练模型,我们提出了一种称为WAV2VEC-S的预训练方法,我们使用特定于任务的半监督预培训来完善自我监督的预培训因此,ASR任务的预训练模型更有效地利用了预培训模型的能力来生成针对ASR的任务特定表示。实验表明,与WAV2VEC 2.0相比,WAV2VEC-S仅需要训练前时间的边际增长,但可以显着改善在内域,跨域和跨语言数据集上的ASR性能。 1H和10H微调分别为24.5%和6.6%。此外,我们表明,半监督的预训练可以通过规范相关分析来弥合自我监管的预训练模型与相应的微调模型之间的表示差距。
translated by 谷歌翻译
像窗户,瓶子和镜子等玻璃状物体在现实世界中存在广泛存在。感应这些对象有许多应用,包括机器人导航和抓握。然而,由于玻璃样物体背后的任意场景,这项任务非常具有挑战性。本文旨在通过增强的边界学习解决玻璃状物体分割问题。特别是,我们首先提出了一种新的精致差分模块,其输出更精细的边界线索。然后,我们介绍了一个边缘感知点的图形卷积网络模块,以沿边界模拟全局形状。我们使用这两个模块来设计解码器,该解码器产生准确和干净的分段结果,尤其是在对象轮廓上。两个模块都是重量轻且有效的:它们可以嵌入到各种分段模型中。在最近的三个玻璃状物体分割数据集上进行了广泛的实验,包括Trans10K,MSD和GDD,我们的方法建立了新的最先进的结果。我们还说明了我们在三个通用分段数据集中的方法的强大泛化属性,包括城市景观,BDD和Coco Sift。代码和模型可用于\ url {https:/github.com/hehao13/ebrnet}。
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译
Rankings are widely collected in various real-life scenarios, leading to the leakage of personal information such as users' preferences on videos or news. To protect rankings, existing works mainly develop privacy protection on a single ranking within a set of ranking or pairwise comparisons of a ranking under the $\epsilon$-differential privacy. This paper proposes a novel notion called $\epsilon$-ranking differential privacy for protecting ranks. We establish the connection between the Mallows model (Mallows, 1957) and the proposed $\epsilon$-ranking differential privacy. This allows us to develop a multistage ranking algorithm to generate synthetic rankings while satisfying the developed $\epsilon$-ranking differential privacy. Theoretical results regarding the utility of synthetic rankings in the downstream tasks, including the inference attack and the personalized ranking tasks, are established. For the inference attack, we quantify how $\epsilon$ affects the estimation of the true ranking based on synthetic rankings. For the personalized ranking task, we consider varying privacy preferences among users and quantify how their privacy preferences affect the consistency in estimating the optimal ranking function. Extensive numerical experiments are carried out to verify the theoretical results and demonstrate the effectiveness of the proposed synthetic ranking algorithm.
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译